to my humble landing page and to the Introduction to Open Data Science-course together with me and the others!
Data science is growing in importance and on this course we will be learning related practical skills and of course, have fun when doing it!
Here is a link to my GitHub repository
This week I’ve worked with wrangling, tidying and interpreting questionnaire data and learned how to plot and test linear regression models. Seems like attitude is a main factor affecting the test scores among the variables included in the study. The week has been a good refresher of the previous statistics courses and I have learned more about using Rmarkdown for presenting the results from R.
The data in question is a survey of students’ study skills SATS and their attitudes towards statistics ASSIST gathered 3.12.2014 - 10.1.2015. Students’ approach to study skills were divided into deep, strategic and surface categories. Here are the descriptions from a presentation by Kimmo Vehkalahti:.
Surface approach: memorizing without understanding, with a serious lack of personal engagement in the learning process.
Deep approach: intention to maximize understanding, with a true commitment to learning.
Strategic approach: the ways students organize their studying (apply any strategy that maximizes the chance of achieving the highest possible grades).
Points variable states the points received in a statistics exam. Those who had 0 points (which means that the participant did not attend the exam) were filtered out from the data.
learning2014 <- read.csv("https://raw.githubusercontent.com/BreezewindX/IODS-project/master/data/learning2014.csv")
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
The data consists of seven variables and has the query results of 166 persons, which of only 56 were men (maybe a bit small sample size). There was approximately twice the amount of females compared to males among the participants. The age of participants was between 17 and 55 years.
library(GGally)
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
From the matrix we can see that the variables that correlate positively with points in the exam the most are attitude (0.437) and stra (0.146). Variable surf (-0.144) is negatively correlated. Correlation typically refers to how close two variables are to having a linear relationship with each other.
There is not that much correlation between the study skills, largest is -0.324 between ‘deep’ and ‘surf’ (which is notable, but logical as it’s unlikely that a student using surface approach uses also deep approach).
The distribution of age seems to be heavily skewed to the left, which means that a big portion of the participants were relatively young. The median age for men was a bit higher than for women.
The mean of attitude score for men is somewhat higher than for women and in general not that many men scored low in attitude. There seems to be not much difference in the use of deep study skills between genders and women seem to rely more on the strategic study skills.
We’ll choose the following linear model for regression \[points_i \sim \beta_i+\beta_2attitude_i+\beta_3stra_i+\beta_4surf_i+\epsilon_i\] Where \(\beta_i\) is the intercept and \(\epsilon_i\) is the error term.
# create a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
According to Multiple R-squared value, predictors jointly explain only 0.21 (21%) of the observed variance on the dependent variable. F-statistic for the regression tells us that we can reject the null hypothesis that all our explanatory variables are zero, so we don’t have to abandon the model at this point.
The positive coeffiecient of 3.40 for variable attitude is statistically significant even at 0.001 significance level.
P-values for variables stra and surf are higher than 0.05, which means we cannot, at the 95% confidence level, reject the null hypothesis that their coefficients are zero (ceteris paribus). Thus they are candidates for discarding.
Let’s test and alternative hypothesis that both stra and surf are jointly zero, \(H_0:\beta_3=\beta_4=0\).
library(car)
ss_results<- linearHypothesis(my_model, c("stra = 0", "surf = 0"))
print(ss_results)
## Linear hypothesis test
##
## Hypothesis:
## stra = 0
## surf = 0
##
## Model 1: restricted model
## Model 2: points ~ attitude + stra + surf
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 164 4641.1
## 2 162 4544.4 2 96.743 1.7244 0.1815
We get the P-value of 0.182 (>0.05) which means we cannot reject the joint null hypothesis at 95% confidence level.
Lets now drop these two ‘redundant’ variables. The new regression model is \[points_i \sim \beta_1 +\beta_2attitude_i+\epsilon_i\]
# create a regression model without redundant variables
new_model <- lm(points ~ attitude, data = learning2014)
# print out a summary of the model
summary(new_model)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Now the Multiple \(R^2\) is almost the same as earlier, even if we dropped the two variables. This is good because adding more variables usually makes \(R^2\) grow. We can interpret this so that the model has a better fit. Residuals have now smaller variation. The explanatory variable explains ~19% of the variance in the exam points.
The assumptions made
1. errors are normally distributed
2. errors are not correlated
3. the errors have constant variance (homoscedasticity)
\[\epsilon\sim N(0,\sigma^2) \]
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model, which=c(1,2,5))
Looking at the Q-Q-plot, the assumption of normality of the residuals seems reasonable, as most of the points are located on the line, even if some outliers can be seen at the edges.
No obvious pattern emerges in the plotting of residuals vs fitted values (variance seems constant) and none of the observations have big leverage over the whole regression.
Visual inspection implies that these assumptions are met. It’s however a bit hard to say for sure that the variance is constant, so few tests might be in place to confirm this.
We can test for multiplicative heteroscedasticity using Breusch-Pagan test, testing against \(H_0\) of no heteroscedasticity.
library(lmtest)
library(dplyr)
new_model %>%
lmtest::bptest() -> bpresults
print(bpresults)
##
## studentized Breusch-Pagan test
##
## data: .
## BP = 0.0039163, df = 1, p-value = 0.9501
We get p-value = 0.95, so cannot reject null hypothesis - we’ve found no evidence against homoscedasticity.
The White test is a generalization of the Breusch-Pagan test and may detect more general forms of heteroscedasticity.
whiteresults <- bptest(new_model, ~ attitude + I(attitude^2), data = learning2014)
print(whiteresults)
##
## studentized Breusch-Pagan test
##
## data: new_model
## BP = 0.088873, df = 2, p-value = 0.9565
With the White test we get a p-value of 0.957, which means that we have not found any evidence against the assumptions of our linear regression model.
The data are from two identical questionaires related to secondary school student alcohol comsumption in Portugal. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). You can find more information here.
Following adjustments have been made:
1. The variables not used for joining the two data have been combined by averaging (including the grade variables)
2. alc_use is the average of Dalc and Walc
3. high_use is TRUE if alc_use is higher than 2 and FALSE otherwise
Let’s takea look at the names of the variables in our dataset:
alc <- read.csv("https://raw.githubusercontent.com/BreezewindX/IODS-project/master/data/alc.csv")
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
I’ll choose high_use as dependent variable and failures, absences and sex as explanatory variables.
My intuition is that
* a high use of alcohol is likely related to absences and also failures.
* drinking might lead to failing and failing might induce even more drinking
* absences might be indicator of failures by itself, even without high alcohol consumption, as there can be multiple reasons for these absences.
* alcohol consumption habits might differ between men and women.
The logistic regression model is \[logit(p)=log(p/(1-p))=highuse \sim \beta_1+\beta_2failures+\beta_3absences+\beta_4sex+\epsilon_i\]
library(ggplot2)
ggplot(alc, aes(x = high_use, fill = sex)) +
geom_bar(position = "fill")
Majority of the students with high alcohol consumption were men.
library(ggpubr)
library(dplyr)
dens_score<- ggplot(alc, aes(x = G3)) +
geom_density()
x <- seq(1, 18, length.out=100)
df <- with(alc, data.frame(x = x, y = dnorm(x, mean(G3), sd(G3))))
dens_score2<-dens_score + geom_line(data = df, aes(x = x, y = y), color = "red")
dens_fail <- ggplot(alc, aes(x = failures, fill = high_use)) +
geom_bar(position="fill")
dens_abs <- ggplot(alc, aes(x = absences, fill = high_use)) +
geom_bar(position = "fill")
dens2_abs <- ggplot(alc, aes(x = absences, fill = high_use)) +
geom_histogram()
# facet_wrap(~ high_use)
ggarrange(dens_score2, dens_fail, dens_abs, dens2_abs,
labels = c("A", "B", "C", "D"),
ncol = 2, nrow = 2)
From A we can see that the scores are somewhat, even if not completely, normally distributed.
B shows the number of failures as a ratio between students with high and low alcohol consumption. We can see that the share of students with high alcohol use grows, as the amount of failures grow. The findings support the hypothesis that with high alcohol consumption grows the risk of failure.
Also, in C the share of students with high alcohol consumption grows as the amount of absences grow. This supports the view that students with high alcohol consumption have more absences than the students with low alcohol consumption.
D is graph C in absolute values, where we can see that the mumber of absencees is diminishing. There are some outliers in the data.
library(ggplot2)
# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade")
Distribution of grades seem more symmetric among students with low alcohol consumption, and grades of both sexes have more spread than in the case of high consumption. Grades of men are clearly more affected by heavy consumption, than women’s, as the mean of grades for women is almost the same in both cases. For men, there are outliers who received bad grades and how much they drink does not seem to affect it.
# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("abcences") + ggtitle("Student absences by alcohol consumption and sex")
Absences seem to increase when alcohol consumption is high, which makes sense as high consumption has tendency to make you feel like crap the next morning. Again, we can see that absences of men are more affected by drinking. Women have more absences than men when consuming a little and there are pretty high amount of absences for the outliers in the women’s data.
library(GGally)
vars <- c("high_use","failures","absences","sex")
ggpairs(alc, columns = vars, mapping = aes(col = sex, alpha = 0.3), lower = list(combo = wrap("facethist")))
Last but not least, we can see that there’s not much correlation between failures and absence. There was about the same amount of both of the sexes in the data, even if there was slightly more women.
# find the model with glm()
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1855 -0.8371 -0.6000 1.1020 2.0209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.90297 0.22626 -8.411 < 2e-16 ***
## failures 0.45082 0.18992 2.374 0.017611 *
## absences 0.09322 0.02295 4.063 4.85e-05 ***
## sexM 0.94117 0.24200 3.889 0.000101 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 424.40 on 378 degrees of freedom
## AIC: 432.4
##
## Number of Fisher Scoring iterations: 4
#Round the coefficents for cleaner look
clean <- round(coef(m), digits=3)
print(clean)
## (Intercept) failures absences sexM
## -1.903 0.451 0.093 0.941
mean(m$residuals)
## [1] 0.02143222
Here we can see the summary of the model. All explanatory variables are statistically significant at 95% confidence level. Absences and sex are significant even with 99.9%.
For every one unit change in failures, the log odds of high_use (versus low-use) increases by 0.451.
For every one unit change in absences, the log odds of high_use (versus low-use) increases by 0.093.
If the person in question is a male, the log odds of high_use increases by 0.941.
Therefore this fitted model says that, holding failures and absences at a fixed value, the odds of a high alcohol consumption for a male (sex = 1) over the odds of high alcohol consumption for female (sex = 0) is exp(0.941166) ≈ 2.563. In terms of percent change, we can say that the odds for men are 156% higher than the odds for women. The coefficient for failures says that, holding absences and sex at a fixed value, we will see 57% increase in the odds of having high alcohol consumption for a one-unit increase in failures since exp(0.4508198) ≈1.57. Variable absences has the same interpretation.
Residuals of the regression are close to mean zero, which is what we would want.
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1491257 0.09395441 0.228611
## failures 1.5695984 1.08339644 2.294737
## absences 1.0977032 1.05169654 1.150848
## sexM 2.5629682 1.60381392 4.149405
Here we can see the odds ratios for the variables and their corresponding confidence intervals. Note that for logistic models, confidence intervals are based on the profiled log-likelihood function.
Following is an excerpt from here.
An odds ratio measures the association between a predictor variable (x) and the outcome variable (y). It represents the ratio of the odds that an event will occur (event = 1) given the presence of the predictor x (x = 1), compared to the odds of the event occurring in the absence of that predictor (x = 0). For a given predictor (say x1), the associated beta coefficient (b1) in the logistic regression function corresponds to the log of the odds ratio for that predictor. If the odds ratio is 2, then the odds that the event occurs (event = 1) are two times higher when the predictor x is present (x = 1) versus x is absent (x = 0).
From the odds ratios we have here we can say that it’s more likely that student has high alcohol consumption if there are failures present and if the student is a male the odds are over 2.5 times as high as if they were female.
Looks like the odds ratio is a square of the regression coefficient.
The confidence interval can be interpreted so that the true value of the coefficient belongs to this interval at 95% of the times. Confint function is showing two-tailed, from the 2.5% point to the 97.5% point of the relevant distribution, which form the upper and lower limits of the intervals.
Lets recap my intuition in the beginning of this analysis part:
Seems like failures and high use of alcohol are related. We can see this from the statistically significant coefficients of the regression.
Absences are also statistically significant, but presence of them does not raise the odds that much. There is probably many more reasons for being absent than alcohol intake.
The data-analysis would seem to support the evidence, that it’s more likely that men have high alcohol consumption in the study group, and that they also get lower grades and are more absent with high alcohol consumption.
Target variable versus the predictions (2x2) tabulation
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 259 9
## TRUE 84 30
We can see that the model predicted quite accurately when high_use was false, as we got only 9 predictions wrong out of 268.
However, it did not do so well when predicting when it’s true, as prediction was false 84 times out of 114. Graphical representation follows.
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67801047 0.02356021 0.70157068
## TRUE 0.21989529 0.07853403 0.29842932
## Sum 0.89790576 0.10209424 1.00000000
Computing the total proportion of inaccurately classified individuals (= the training error):
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss <- loss_func(class = alc$high_use, prob = alc$probability)
print(loss)
## [1] 0.2434555
On average, our model predicts the end result wrong approx. 24% of the predictions in the training data.
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2539267
Average number of wrong predictions in the cross validation is approximately 0.25. My chosen model is exactly the same as in the datacamp exercise, so of course the prediction error is the same. When comparing different models, we would prefer a model with smaller penalty (i.e. error). If one could re-select the explanatory variables so that the error would be smaller, that model could be preferred in prediction over this one. If more variables are added to the regression, the average error grows.
library(MASS)
# load the data
data("Boston")
library(MASS)
# load the data
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Boston dataframe has 506 observations of 14 variables:
crim - per capita crime rate by town.zn - proportion of residential land zoned for lots over 25,000 sq.ft.indus - proportion of non-retail business acres per town.chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).nox - nitrogen oxides concentration (parts per 10 million).rm - average number of rooms per dwelling.age - proportion of owner-occupied units built prior to 1940.dis - weighted mean of distances to five Boston employment centres.rad- index of accessibility to radial highways.tax - full-value property-tax rate per $10,000.ptratio - pupil-teacher ratio by town.black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.lstat - lower status of the population (percent).medv - median value of owner-occupied homes in $1000s.The information about the dataset can be found here.
# plot matrix of the variables
pairs(Boston)
Pairs graph is a mess so far, so let’s wait if we can discard some variables and then maybe try drawing it again…
Let’s draw the correlations between the variables.
library("tidyverse")
library("dplyr")
library("corrplot")
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits=2)
# print the correlation matrix
#print(cor_matrix)
# visualize the correlation matrix
corrplot.mixed(cor_matrix, tl.cex = 0.7, number.cex = .8)
We can see that the most correlated variables are
library(reshape2)
CM <- cor_matrix # Make a copy of the matrix
CM[lower.tri(CM, diag = TRUE)] <- NA # lower tri and diag set to NA
subset(melt(CM, na.rm = TRUE), abs(value) > .7)
## Var1 Var2 value
## 59 indus nox 0.76
## 89 nox age 0.73
## 101 indus dis -0.71
## 103 nox dis -0.77
## 105 age dis -0.75
## 129 indus tax 0.72
## 135 rad tax 0.91
## 195 lstat medv -0.74
Variables rad and tax are highly positively correlated (0.91). This can be interpretated so that the properties with better access to city’s radial highways also have higher property tax.
Next ones are nox and dis with -0.77 negative correlation. Interpretation is that smaller the weighted average distance to the five Boston employment centers, the larger is the nitrogen oxides concentration (worse air pollution).
As a third, with 0.76 positive correlation can be found ìndus and nox. Interpretation is that higher the proportion of non-retail business acres in town, higher is the concentration of the nitrogen oxides.
(One thing that can be noted is that the status of the inhabitants and the number of rooms correlate strongly with the housing value.)
For convenience, let’s plot histograms only for these variables of interest.
library(ggpubr)
library(ggplot2)
x <- seq(1, length.out=dim(Boston)[1])
rad_plot<-ggplot(Boston, aes(x=rad)) +
geom_histogram()
tax_plot <- ggplot(Boston, aes(x=tax)) +
geom_histogram()
nox_plot <- ggplot(Boston, aes(x=nox)) +
geom_histogram()
dis_plot <- ggplot(Boston, aes(x=dis)) +
geom_histogram()
indus_plot <- ggplot(Boston, aes(x=indus)) +
geom_histogram()
ggarrange(rad_plot, tax_plot, nox_plot, dis_plot, indus_plot,
labels = c("A", "B", "C", "D", "E"),
ncol = 3, nrow = 2)
Plot A rad is index of accessibility to radial highways. There are two concentrations in the graph, first one is index of 4-5 and the second index 24. There are 132 observations in the latter one (with index value 24), which is a high percentage of all the observations. It would make sense to check out the map of Boston to better understand the outliers.
Plot B tax has the full-value property-tax rate, and the distribution is pretty similar to what we had in plot A, i.e. concentration in the first quartile and high amount of outliers in the last quartile (132 observations, value 666).
Plots C and D (nox and dis) have also similar distributions skewed to the right.
Plot E indus is skewed to the right, and has a high count (132 observations) on a single value (18.1).
There seems to be 132 observations in the data, that are outliers for some reason.
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Scaled dataset is of type matrix, so we’ll convert it into a dataframe. In the new dataframe all the variables have zero mean.
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
Linear discriminant analysis produces results based on the assumptions that
Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))
# look at the table of the new factor crime
#table(crime)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
#lda.fit
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
From the graph we can see the variables that imply high crime, the most important factor being rad (the arrows pointing at the high crime cluster).
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 18 15 0 0
## med_low 9 8 4 0
## med_high 0 7 15 1
## high 0 0 0 25
The model seems to predict high crime very well (all are correct) and taking a look at the predictions as a whole, the model seems to do better predicting higher rates. In all cases, the model had predicted correctly more often than wrong.
First we calculate the Euclidean distance, and then with Manhattan distance.
library(factoextra)
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# euclidean distance matrix
dist_eu <- get_dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method="manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
############
# K-means #
############
k_max <- 10
set.seed(123)
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
From the graph we choose to have two centers, as it is the point where the slope of the line changes from steep to level (using the elbow method). Let’s run K-means with two centers.
# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)
library(tidyverse)
library(data.table)
boston_scaled %>%
mutate(Cluster = km$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean") %>%
DT::datatable(extensions = 'FixedColumns', options = list(dom = 't',
scrollX = TRUE,
scrollCollapse = TRUE))
You can scroll the datatable to see the rest of the variables, which are summarised by their means grouped by cluster number. We can see that there are differences between the two clusters, e.g. cluster #2 has higher crime, bigger proportion of non-retail business acres, more pollution, buildings are older and it’s further from the employment centers, there are less teachers per pupil and the median value of owner-occupied homes is lower.
Lets draw pairs according to the two clusters:
# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)
Color red is cluster #2 and black is nicer neighbourhood cluster #1.
We can see from the graph what we earlier saw numerically from the datatable. In the upper row we have crime against the variables, and we can see that cluster #2 has higher crime rates in every aspect than cluster #1.
# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled[1:5], col = km$cluster)
If we compare crim and indus we see that the red cluster has higher crime and it is more industrialised than the black one. It’s the same with variable nox, and it’s logical that there is also more air pollution. zn tells us that red cluster is not near the river area, unlike the black cluster.
We could go on and on analysing all the variables in the graph, but I’m certain it’s not the meaning of this exercise.
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
set.seed(123)
k3 <- kmeans(boston_scaled, centers = 3)
k4 <- kmeans(boston_scaled, centers = 4)
k5 <- kmeans(boston_scaled, centers = 5)
k6 <- kmeans(boston_scaled, centers = 6)
k7 <- kmeans(boston_scaled, centers = 7)
# plots to compare
p2 <- fviz_cluster(k3, geom = "point", data = boston_scaled) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = boston_scaled) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = boston_scaled) + ggtitle("k = 5")
p5 <- fviz_cluster(k6, geom = "point", data = boston_scaled) + ggtitle("k = 6")
p6 <- fviz_cluster(k7, geom = "point", data = boston_scaled) + ggtitle("k = 7")
library(gridExtra)
grid.arrange(p2, p3, p4, p5, p6, nrow = 3)
I choose k=3 as it has the best separation in my opinion.
set.seed(123)
km <- boston_scaled %>%
kmeans(centers = 3)
lda.fit <-lda(km$cluster~.,data=boston_scaled)
classes<-as.numeric(km$cluster)
plot(lda.fit, dimen = 2, col=classes)
lda.arrows(lda.fit, myscale = 4)
Rad and tax (and maybe indus) are the most influencial linear separators for cluster #2. Age and chas for cluster #1. Dis and rm for cluster #3. Black affects towards clusters #1 and #3. Selecting the right amount of clusters seem to be important, so it’s useful to know some methods to choose optimal amount of them (for example using the elbow method we used earlier).
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
set.seed(123) #Setting seed
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))
# look at the table of the new factor crime
#table(crime)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
lda.fit <- lda(crime ~ ., data = train)
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
set.seed(123)
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
myset <- boston_scaled[ind,]
km <-kmeans(myset, centers = 2)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)
We can see that one cluster includes approximately the dots with high and med-high crime, and the other cluster the lower crime dots. There are no other differences between the two graphs, because I set the seed and the data for both is the same. I’m using the optimal two centers here for the clusters.
library(readr)
human <- read.csv("~/Documents/GitHub/IODS-project/data/human.csv", row.names=1)
str(human)
## 'data.frame': 155 obs. of 9 variables:
## $ country : Factor w/ 155 levels "Afghanistan",..: 105 6 134 41 101 54 67 149 27 102 ...
## $ edu2F : num 97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
## $ labF : num 61.2 58.8 61.8 58.7 58.5 53.6 53.1 56.3 61.6 62 ...
## $ lifeExp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ expSchool : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ gni : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ mamo.ratio : int 4 6 6 5 6 7 9 28 11 8 ...
## $ adobirthrate: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ parlrep : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
## country edu2F labF lifeExp
## Afghanistan: 1 Min. : 0.90 Min. :13.50 Min. :49.00
## Albania : 1 1st Qu.: 27.15 1st Qu.:44.45 1st Qu.:66.30
## Algeria : 1 Median : 56.60 Median :53.50 Median :74.20
## Argentina : 1 Mean : 55.37 Mean :52.52 Mean :71.65
## Armenia : 1 3rd Qu.: 85.15 3rd Qu.:61.90 3rd Qu.:77.25
## Australia : 1 Max. :100.00 Max. :88.10 Max. :83.50
## (Other) :149
## expSchool gni mamo.ratio adobirthrate
## Min. : 5.40 Min. : 581 Min. : 1.0 Min. : 0.60
## 1st Qu.:11.25 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65
## Median :13.50 Median : 12040 Median : 49.0 Median : 33.60
## Mean :13.18 Mean : 17628 Mean : 149.1 Mean : 47.16
## 3rd Qu.:15.20 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95
## Max. :20.20 Max. :123124 Max. :1100.0 Max. :204.80
##
## parlrep
## Min. : 0.00
## 1st Qu.:12.40
## Median :19.30
## Mean :20.91
## 3rd Qu.:27.95
## Max. :57.50
##
human dataframe has 155 observations of 9 variables:
country - country nameedu2F - proportion of females with at least secondary educationlabF - proportion of females in the labour forcelifeExp - life expectancy at birthexpSchool - expected years of schoolinggni - gross national income per capitamamo.ratio - maternal mortality ratioparlrep - percetange of female representatives in parliamentadobirthrate - adolescent birth rate.From which edu2F and labF are proportional combined variables.
The database is a combined database of Human Development index and Gender Inequality index, which are a part of United Nations Development Programme’s Human Development reports.
More information about the dataset can be found here.
Technical notes can be found here.
This time there is no binary variable in the data. We can see that for some variables the mean and median are not the same number, so the distributions for them seem skewed. We will see them better next when we’ll plot the variables.
Let’s start with histograms and density plots, followed by pairs and correlation graphs.
library(purrr)
library(tidyr)
library(ggplot2)
human %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
human %>%
keep(is.numeric) %>% # Keep only numeric columns
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) + # Plot the values
facet_wrap(~ key, scales = "free") + # In separate panels
geom_density()
Some of the variables like expSchool, labF and parlrep look like they might be normally distributed. edu2F might be trimodal. adobirthrate, gni and mamo.ratio are skewed to the right, and lifeExp and expSchool to the left.
Next, I’ll create a categorical variable based on the quartiles of the GNI index, to see if better how the variables differ by their GNI score.
library(GGally)
# create a categorical variable 'gni'
bins <- quantile(human$gni)
gnicat <- cut(human$gni, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))
ggpairs(dplyr::select(human, -country), mapping = aes(col = gnicat, alpha = 0.3), lower = list(combo = wrap("facethist")))
The order of the coloring is red, green, blue, purple from low to high scores in the GNI index.
library("tidyverse")
library("dplyr")
library("corrplot")
# calculate the correlation matrix and round it
cor_matrix<-cor(dplyr::select(human, -country)) %>% round(digits=2)
# print the correlation matrix
#print(cor_matrix)
# visualize the correlation matrix
corrplot.mixed(cor_matrix, tl.cex = 0.7, number.cex = .8)
Countries with higher GNI index score have higher education shares for women, than the lower ones. They also have higher expected years of schooling and life expectancy. Labor participation rate for women is on the higher end in the quartile that scores lowest in GNI. There’s surprisingly little difference between the four GNI quartiles when it comes to women’s parliamental participation rate (there is a presumption that these countries have parliaments). Only the fourth quartile has something of a normal distribution in it and the others are skewed to the right.
Let’s sum up the biggest correlations between the variables (arranged into a decreasing order in absolute value)
library(reshape2)
CM <- cor_matrix # Make a copy of the matrix
CM[lower.tri(CM, diag = TRUE)] <- NA # lower tri and diag set to NA
corr_list<-subset(melt(CM, na.rm = TRUE), abs(value) > .7)
corr_list %>% arrange(desc(abs(value))) -> corr_list2
print(corr_list2)
## Var1 Var2 value
## 1 lifeExp mamo.ratio -0.86
## 2 lifeExp expSchool 0.79
## 3 edu2F expSchool 0.78
## 4 mamo.ratio adobirthrate 0.76
## 5 expSchool mamo.ratio -0.74
## 6 lifeExp adobirthrate -0.73
## 7 edu2F lifeExp 0.72
## 8 edu2F mamo.ratio -0.72
Life expectancy had strong negative correlation (-0.86) with maternal mortality ratio. This one is pretty self-explanatory.
Life expectancy had a strong positive correlation with the expected years of schooling (0.79), and it is logical as when life expectancy is higher, parents will value schooling more as an investment.
Proportion of females with at least secondary education correlated positively (0.78) with expected years of schooling (of course as higher portion of population attend schooling, also the population average will rise).
Next one is positive correlation (0.76) between maternal mortality ratio and adolescent birth rate, which again makes sense, as higher the adolescent birth rate is (how many childbirths women give during their lives), naturally more mortality there is among these women.
As a last remark, education share for women seems to have a positive relationship with higher life expectancy and smaller maternal mortality ratio.
First we perform the PCA to non-standardised data.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(dplyr::select(human, -country))
print(pca_human)
## Standard deviations (1, .., p=8):
## [1] 18544.172057 186.283543 25.972416 20.074641 14.321634
## [6] 10.634338 3.721257 1.428190
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4
## edu2F -9.317458e-04 0.085169068 -0.3652346210 -0.8435797499
## labF 9.124960e-05 -0.031442122 -0.0180958993 -0.3928157700
## lifeExp -2.815825e-04 0.028224407 -0.0170864406 -0.0212996883
## expSchool -9.562916e-05 0.007556395 -0.0210421918 -0.0308785262
## gni -9.999828e-01 -0.005830843 0.0006412388 0.0002381021
## mamo.ratio 5.655739e-03 -0.987500960 -0.1481355205 -0.0173448186
## adobirthrate 1.233962e-03 -0.125305410 0.9184673154 -0.3475453954
## parlrep -5.526462e-05 0.003211742 -0.0038388487 -0.1075784134
## PC5 PC6 PC7 PC8
## edu2F -3.770012e-01 -6.083913e-02 0.0303039622 3.118655e-02
## labF 8.233860e-01 4.077784e-01 -0.0093667016 6.654689e-03
## lifeExp 1.136966e-02 -6.669649e-02 -0.9901494274 1.161211e-01
## expSchool 3.896982e-03 -2.437473e-02 -0.1134676761 -9.925031e-01
## gni 6.651486e-06 2.062449e-05 0.0001028639 1.871381e-05
## mamo.ratio -3.955532e-02 -2.010447e-02 -0.0244043639 -7.101321e-04
## adobirthrate -1.381061e-01 -2.499459e-02 -0.0127951595 -8.079546e-03
## parlrep 3.989026e-01 -9.077136e-01 0.0704544742 1.925677e-02
# draw a biplot of the principal component representation and the original variables
s <- summary(pca_human)
print(s)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 186.2835 25.97 20.07 14.32 10.63 3.721
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.00 0.00 0.000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.00 1.00 1.000
## PC8
## Standard deviation 1.428
## Proportion of Variance 0.000
## Cumulative Proportion 1.000
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
Lets check out the percentages of variance of principal components and make the biplot.
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot1<-biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Seems like without standardising the data one principal component explains all of the variation, and the graph does not tell us much.
Let’s now standardise the human data.
human_std <- scale(dplyr::select(human, -country))
# summaries of the scaled variables
summary(human_std)
## edu2F labF lifeExp expSchool
## Min. :-1.76122 Min. :-2.43186 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.91250 1st Qu.:-0.50289 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.03967 Median : 0.06116 Median : 0.3056 Median : 0.1140
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.96275 3rd Qu.: 0.58469 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 1.44288 Max. : 2.21762 Max. : 1.4218 Max. : 2.4730
## gni mamo.ratio adobirthrate parlrep
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# change the object to data frame
human_std <- as.data.frame(human_std)
Then perform the PCA (principal component analysis) for the standardised data.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
print(pca_human)
## Standard deviations (1, .., p=8):
## [1] 2.1359720 1.1170602 0.8767066 0.7202736 0.5530904 0.5299143 0.4493522
## [8] 0.3372776
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## edu2F -0.39874523 0.101737848 -0.210481695 -0.31033900 0.43437742
## labF 0.14439668 0.682158952 -0.575062704 -0.27300332 -0.22413576
## lifeExp -0.43080653 0.003997077 0.073302641 -0.02783772 0.04920168
## expSchool -0.41858363 0.138149663 -0.073869337 -0.07719277 0.30831448
## gni -0.33914119 0.098797891 -0.338769060 0.84020987 -0.01249472
## mamo.ratio 0.41991628 0.123208094 -0.145471957 0.28520785 0.05493343
## adobirthrate 0.40271826 0.088902996 -0.003213286 0.11830579 0.81248359
## parlrep -0.07626861 0.687286162 0.691544283 0.14537131 -0.01724555
## PC6 PC7 PC8
## edu2F -0.50478735 0.48033587 0.12578655
## labF 0.23657515 0.01076780 -0.04754207
## lifeExp 0.57970641 0.04122211 0.68415054
## expSchool -0.11017948 -0.81713841 -0.13919229
## gni 0.01564322 0.16310711 -0.16583233
## mamo.ratio -0.43780233 -0.24286926 0.67254029
## adobirthrate 0.36928738 0.07464930 -0.11761127
## parlrep -0.11284562 0.09265604 -0.02894539
# draw a biplot of the principal component representation and the original variables
s <- summary(pca_human)
print(s)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.1360 1.1171 0.87671 0.72027 0.55309 0.5299
## Proportion of Variance 0.5703 0.1560 0.09608 0.06485 0.03824 0.0351
## Cumulative Proportion 0.5703 0.7263 0.82235 0.88720 0.92544 0.9605
## PC7 PC8
## Standard deviation 0.44935 0.33728
## Proportion of Variance 0.02524 0.01422
## Cumulative Proportion 0.98578 1.00000
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
Lets check out the percentages of variance captured by the principal components and make a biplot.
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 57.0 15.6 9.6 6.5 3.8 3.5 2.5 1.4
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot2<-biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
Figure A: Women’s participation in society and childbearing
The results are now different. This is because PCA is sensitive to the relative scaling of the original features, thus standardising the data is a good idea.
My interpretation of PC#1 is that it describes the relation between education and birthrate/maternal mortality, i.e. when families have less children who live longer, the education will be seen as more and more important. General education is also likely to improve women’s survival of childbirth.
My interpretation of PC#2 is that it describes the women’s participation in the society, i.e. for example representation in the parliament or participation in the labour markets. The principal component seems to be an aggregate of our two variables parlrep and labF and there does not seem to be a variable that has the opposite effect. I think this is because the variables are shares of multiple variables, which we had in the data wrangling section before combining them into this one.
For example in top left corner we have the Scandinavian countries with high schooling and high participation ration for women, and at the bottom we can find countries where women are known to be repressed in the state politics, like Iran, Iraq and Pakistan. These countries are still relatively advanced, as we can see from maternal mortality rate. Some African countries on the right have quite high participation ratios for women, but still maternal mortality rate is quite high.
For the data tea, 300 tea consumers were interviewed about their consumption of tea. The questions were about how they consume tea, how they think of tea and descriptive questions (sex, age, socio-professional category and sport practise). 122 male and 178 female participated, from age 15 to 90, median age being 32 years and mean 37. We have categorical, qualitative and quantitative variables in the data. For the age, the data set has two different variables: a continuous and a categorical one. The data is part of FactoMineR package for R, and more information can be found from the FactoMineR homepage.
library(FactoMineR)
data(tea)
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "age_Q")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
## where age_Q
## chain store :192 15-24:92
## chain store+tea shop: 78 25-34:69
## tea shop : 30 35-44:40
## 45-59:61
## +60 :38
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ age_Q: Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
We have selected columns Tea, How, how, sugar, where and age_Q from the data.
After this selection, the dataset tea has 300 observations of 6 variables of type factor.
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
From the bar plot we can read the age distribution by categorical division. Note that the first age group in the graph is +60 (it would probably be the last one if named “60+”). We see that most of the people use tea bags and users of unpackaged tea are in minority. It’s likely that tea is bought in bags from a chain store, as the proportions are similar. The tea is usually drank without supplements, with only sugar added. Around half of the drinkers do not use sugar at all. Majority of the tea quality is Earl Grey tea, which is consumed about 2.5 times over the regular black tea.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.313 0.266 0.249 0.205 0.184 0.175
## % of var. 13.434 11.404 10.689 8.791 7.883 7.511
## Cumulative % of var. 13.434 24.838 35.527 44.318 52.201 59.712
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.168 0.147 0.141 0.135 0.119 0.101
## % of var. 7.218 6.302 6.051 5.787 5.086 4.348
## Cumulative % of var. 66.930 73.231 79.282 85.069 90.155 94.503
## Dim.13 Dim.14
## Variance 0.072 0.056
## % of var. 3.094 2.403
## Cumulative % of var. 97.597 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.068 0.005 0.002 | 0.048 0.003 0.001 | -0.471
## 2 | 0.142 0.021 0.009 | 0.089 0.010 0.004 | -1.069
## 3 | -0.199 0.042 0.033 | -0.254 0.081 0.053 | -0.600
## 4 | -0.798 0.678 0.665 | -0.240 0.072 0.060 | 0.064
## 5 | -0.199 0.042 0.033 | -0.254 0.081 0.053 | -0.600
## 6 | -0.582 0.360 0.361 | -0.167 0.035 0.030 | -0.235
## 7 | -0.190 0.039 0.022 | -0.036 0.002 0.001 | -0.411
## 8 | 0.150 0.024 0.009 | 0.307 0.118 0.036 | -0.880
## 9 | 0.268 0.076 0.026 | 0.900 1.016 0.290 | 0.175
## 10 | 0.605 0.390 0.137 | 0.870 0.948 0.282 | -0.073
## ctr cos2
## 1 0.296 0.106 |
## 2 1.527 0.527 |
## 3 0.481 0.297 |
## 4 0.005 0.004 |
## 5 0.481 0.297 |
## 6 0.074 0.059 |
## 7 0.226 0.103 |
## 8 1.035 0.298 |
## 9 0.041 0.011 |
## 10 0.007 0.002 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.750 7.377 0.184 7.421 | 0.467 3.365 0.071
## Earl Grey | -0.389 5.179 0.273 -9.036 | -0.016 0.010 0.000
## green | 0.594 2.063 0.044 3.610 | -0.954 6.272 0.113
## alone | -0.096 0.316 0.017 -2.253 | -0.262 2.803 0.128
## lemon | 0.483 1.366 0.029 2.938 | 0.202 0.282 0.005
## milk | -0.091 0.093 0.002 -0.813 | 0.315 1.307 0.026
## other | 0.938 1.404 0.027 2.853 | 2.736 14.071 0.232
## tea bag | -0.460 6.376 0.277 -9.096 | -0.154 0.842 0.031
## tea bag+unpackaged | 0.174 0.504 0.014 2.031 | 0.719 10.150 0.236
## unpackaged | 1.718 18.837 0.403 10.972 | -1.150 9.948 0.180
## v.test Dim.3 ctr cos2 v.test
## black 4.618 | -0.740 9.026 0.179 -7.322 |
## Earl Grey -0.367 | 0.334 4.788 0.201 7.751 |
## green -5.800 | -0.293 0.629 0.011 -1.778 |
## alone -6.183 | -0.047 0.098 0.004 -1.118 |
## lemon 1.229 | 1.264 11.746 0.197 7.685 |
## milk 2.811 | -0.379 2.011 0.038 -3.375 |
## other 8.321 | -0.957 1.836 0.028 -2.910 |
## tea bag -3.046 | -0.436 7.208 0.249 -8.627 |
## tea bag+unpackaged 8.400 | 0.663 9.193 0.200 7.740 |
## unpackaged -7.346 | 0.330 0.873 0.015 2.107 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.275 0.154 0.216 |
## How | 0.060 0.295 0.235 |
## how | 0.484 0.334 0.259 |
## sugar | 0.132 0.013 0.200 |
## where | 0.528 0.632 0.233 |
## age_Q | 0.402 0.169 0.354 |
From the summary we can see that for the variable milk the test value is smaller in absolute value than the critical value of 1.96, so we cannot say that their coordinate is significantly different from zero respect to dim#1.
Respect to dim#2 the same goes for variables Earl Grey and lemon.
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
The distance between variable categories gives a measure of their similarity. Here the Dim 1 explains 13.43% of the variation and Dim 2 explains 11.40% of the variation. Neither dimension seems to explain the majority of the variation, and the percentages are quite close each other.
After plotting the MCA factor map, we can clearly see different groups formed:
unpackaged tea is related to tea shops,
teabag+unpackaged is related to chain store+tea shop,
teabags are related to chain stores.
Sugar and milk is being added more often to bag tea than unpackaged tea, more often to Earl Grey tea than to black or green. Lemon is added mostly to unpackaged tea. “Other” is a curious case, indicating strongly a Dim2.
dimdesc(mca)
## $`Dim 1`
## $`Dim 1`$quali
## R2 p.value
## where 0.52835508 3.402919e-49
## how 0.48367544 2.339016e-43
## age_Q 0.40178129 7.360334e-32
## Tea 0.27495543 1.837208e-21
## sugar 0.13225028 8.232202e-11
## How 0.05978394 3.854850e-04
##
## $`Dim 1`$category
## Estimate p.value
## unpackaged 0.69474080 3.340834e-35
## tea shop 0.69465343 1.212375e-32
## black 0.24172246 7.129309e-15
## No.sugar 0.20372089 8.232202e-11
## +60 0.38974921 2.728687e-09
## green 0.15432353 2.710546e-04
## 45-59 0.11640619 5.815735e-04
## lemon 0.09776924 3.157531e-03
## other 0.35244199 4.165452e-03
## 35-44 0.13205315 4.331606e-03
## tea bag+unpackaged -0.16991252 4.203783e-02
## alone -0.22634169 2.399737e-02
## chain store+tea shop -0.09539809 6.112076e-06
## sugar -0.20372089 8.232202e-11
## Earl Grey -0.39604598 2.004995e-22
## tea bag -0.52482827 9.453587e-23
## 15-24 -0.60392196 3.365056e-30
## chain store -0.59925534 2.883891e-33
##
##
## $`Dim 2`
## $`Dim 2`$quali
## R2 p.value
## where 0.6318637 3.566828e-65
## how 0.3343163 5.692320e-27
## How 0.2947635 2.694094e-22
## Tea 0.1540126 1.635070e-11
## age_Q 0.1689345 3.629091e-11
##
## $`Dim 2`$category
## Estimate p.value
## chain store+tea shop 0.70929060 7.770361e-47
## tea bag+unpackaged 0.47161067 3.582736e-19
## other 1.02576718 8.526004e-19
## black 0.32725469 2.712739e-06
## +60 0.31029257 3.430996e-06
## 35-44 0.19254061 1.476821e-03
## tea bag 0.02118156 2.197206e-03
## milk -0.22316139 4.765518e-03
## 25-34 -0.33024542 1.272642e-07
## green -0.40562971 2.539273e-09
## chain store -0.03682077 2.085882e-09
## alone -0.52113722 1.777027e-10
## unpackaged -0.49279223 1.415218e-14
## tea shop -0.67246984 5.949668e-20
##
##
## $`Dim 3`
## $`Dim 3`$quali
## R2 p.value
## age_Q 0.3536548 5.876798e-27
## how 0.2585093 5.136908e-20
## where 0.2330683 7.697792e-18
## How 0.2348039 4.237186e-17
## Tea 0.2161434 1.968505e-16
## sugar 0.2003056 3.497087e-16
##
## $`Dim 3`$category
## Estimate p.value
## 25-34 0.45976930 2.116883e-16
## Earl Grey 0.28300167 3.119648e-16
## tea bag+unpackaged 0.23830348 3.465086e-16
## sugar 0.22363951 3.497087e-16
## lemon 0.64614264 5.936838e-16
## chain store+tea shop 0.15382682 3.249970e-11
## tea shop 0.18552512 5.921016e-05
## 15-24 0.16779074 5.659078e-03
## unpackaged 0.07220445 3.485880e-02
## +60 -0.15408157 7.466100e-03
## other -0.46305098 3.461660e-03
## milk -0.17423380 6.751923e-04
## black -0.25323911 1.756827e-14
## 45-59 -0.37793254 5.344405e-15
## No.sugar -0.22363951 3.497087e-16
## chain store -0.33935194 6.755202e-19
## tea bag -0.31050793 2.755334e-20
Using the dimdesc we can see that where you buy your tea and how it is packaged is strongly related to Dim 1. Next comes the age_Q and the tea type.
For Dim 2 the list is somewhat different, with where (chain store or tea shop), how (packaged or not) and How (if they use lemon or milk or other with the tea).
I found describing the Dim 2 somewhat problematic, as the group seems to fall between the two ends of buying bagged tea from chain stores and buying unpackaged tea from tea shops. With mostly other being the most important factor. What this “other” that people add in their tea is, I’m not sure. Maybe soy milk or similar, that does not fit under category “milk”? I could not find description for the variable from the home page, so I’m not sure if only one who knows what it means is the one who answered the questions.
My interpretation is that people belonging to Dim 1 are those who prefer drinking unpackaged tea, who buy it from a tea shop, drink it black without sugar and sometimes add a slice of lemon or drink green tea. For those estimates in the list that are negative, goes that if the individual has these characteristics, it is less likely for him/her belonging into this group.
We also make the observation that people belonging into this groups are likely to be older, which we also observe in the MCA plot earlier.
Let’s try out one extra plotting function of FactoMineR.
plotellipses(mca, keepvar=(1:5))
By drawing the confidence ellipses we can see that the confidence ellipses are quite small and don’t overlap, so the sub-populations here are pretty distinct.